Back

BMC Bioinformatics

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match BMC Bioinformatics's content profile, based on 383 papers previously published here. The average preprint has a 0.37% match score for this journal, so anything above that is already an above-average fit.

1
cran2crux: automatically create CRUX ports for R-packages

Petrov, P.; Izzi, V.

2026-05-13 bioinformatics 10.64898/2026.05.09.723963 medRxiv
Top 0.1%
23.3%
Show abstract

MotivationR together with CRAN and Bioconductor provides one of the richest ecosystems for bioinformatics and computational biology, with thousands of specialized packages. While GNU/Linux is a vastly-used operating system in this field, R-packages are typically managed independently of the systems native package manager. This separation makes installation, updates and mass rebuilds cumbersome. CRUX, a minimalist semi-source GNU/Linux distribution, offers great flexibility with its ports-based system for the seamless integration of R-packages with its native package manager. ResultsThe hereby presented cran2crux tool automatically generates CRUX ports for packages from both CRAN and Bioconductor. It performs recursive dependency resolution, handles naming conventions, extracts dependencies information, and supports inclusion of optional dependencies. The tool also provides convenient functions for checking updates and regenerating outdated ports. It can generate over 140 ports for complex packages such as Seurat in approximately 11 seconds, dramatically simplifying the maintenance of large R-dedicated repositories on CRUX. Availabilitycran2crux is available under the MIT license at https://github.com/izzilab/cran2crux. As of now, more than 650 R package ports, generated with the tool, are available in the CRUX ports database.

2
An assessment of normalization and differential expression methods for miRNA-seq analysis using a realistic benchmark dataset

Aparicio-Puerta, E.; Baran, A. M.; Ashton, J. M.; Pritchett, E. M.; Gaca, A.; Becker, J.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.

2026-05-13 bioinformatics 10.64898/2026.05.08.723923 medRxiv
Top 0.2%
22.9%
Show abstract

MicroRNAs are short noncoding RNAs that regulate gene expression and are commonly profiled by small RNA sequencing (miRNA-seq). Despite the widespread use of miRNA-seq, datasets are often analyzed with RNA-seq method such as DESeq2 or edgeR, which do not take into account the specific characteristics of miRNA-seq data. Here, we present a benchmark study of normalization and differential expression approaches using a realistic ground-truth dataset. By mixing mouse RNA of two organs, we generated expression trends while capturing biological and technical variability. Using monotonicity across the dataset and expected fold changes from the mixture design, we assessed normalization and differential expression methods. Normalization benchmarking showed that within-sample scaling, particularly Read Per Million (RPM), best preserved the expected monotonic trends, outperforming cross-sample methods such as TMM, rlog, and VST. These approaches sometimes recovered apparent monotonicity among abundant miRNAs, but inspection of individual profiles suggested likely over-correction. Regarding differential expression, edgeR consistently ranked among the best-performing methods across several metrics, including log2 fold-change estimation, with performance comparable to miRNA-seq-specific tools such as miRglmm and NBSR. DESeq2, edgeR-v4, and limma-based approaches tended to systematically underestimate log2 fold changes. Applying a common RPM-based normalization substantially improved the performance of cross-sample methods, highlighting the strong influence of normalization on differential expression analysis. Overall, our findings support within-sample scaling methods such as RPM for normalization, and edgeR, miRglmm, or NBSR for differential expression. The dataset has been made publicly available, providing a valuable resource for objective method comparison and future miRNA-seq software development.

3
Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv
Top 0.2%
22.7%
Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

4
Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

Kravchenko, P.; Vorontsov, I. E.; Makeev, V. J.; Kulakovskiy, I. V.; Penzar, D. D.

2026-05-14 bioinformatics 10.64898/2026.05.12.724515 medRxiv
Top 0.2%
22.3%
Show abstract

MotivationDNA motifs recognised by transcription factors are typically represented as position weight matrices (PWMs), assuming independent contributions of individual nucleotides to protein binding specificity. Many alternative models accounting for correlations of positional contributions have been introduced in the past decades. However, performance gains have generally not outweighed the advantages of simplicity, interpretability, and practical applicability of PWMs with the well-established codebase. Existing software tools and motif databases provide multiple non-identical PWMs for the same transcription factor or even for the same dataset. It remains a practical question whether these PWMs can be effectively combined into a single improved model. ResultsHere we describe ArChIPelago (https://github.com/autosome-ru/ArChIPelago), a computational framework that combines multiple PWMs into a joint model using classic machine learning techniques, from linear regression to ensembles of decision trees. We show that such a combination improves prediction of transcription factor binding sites in genomic sequences. With a diverse collection of 704 ChIP-Seq datasets spanning 36 orthologous human and mouse transcription factors of diverse structural families, we show that ArChIPelago consistently outperforms the best available individual mono- and dinucleotide PWMs as well as sparse local inhomogeneous mixture models. Furthermore, using both human and mouse data, we demonstrate that PWM ensembles are capable of making reliable cross-species predictions.

5
A Conditional Random Field approach for de novo reconstruction of bacterial haplotypes from a de Bruijn graph representation

Steyaert, A.; Van Hecke, M.; Marchal, K.; Fostier, J.

2026-05-12 bioinformatics 10.64898/2026.05.11.724222 medRxiv
Top 0.2%
19.6%
Show abstract

BackgroundDetecting distinct bacterial strains in a mixed sample is an important, yet less well-developed aspect of metagenomic research. Several methods exist that successfully retrieve a de novo reconstruction of viral strains. However, the reconstruction of bacterial haplotypes poses its own distinct challenges, and methods that successfully reconstruct full genome-length bacterial strains de novo are scarce. Here, we develop HaploDetox, a method for de novo bacterial haplotype reconstruction from short reads. We use a de Bruijn graph representation of the reads in which nodes correspond with k-mers from the read set and arcs represent overlap between two nodes sequences. Our aim is to accurately assign labels to each node and arc in the graph to reveal the presence or absence of their corresponding sequence in individual strains. ResultsUsing a negative binomial mixture model, we model the relationship between the read coverage of nodes and arcs in the graph and their presence in a strain. We achieve improved labelling accuracy by including contextual information from neighbouring nodes and arcs with a Conditional Random Field. These labels are used to extract strain-specific de Bruijn graphs from the original graph. Additionally, we allow users to assess the number of strains present in the dataset based on model selection criteria. We evaluate our node/arc labelling accuracy on simulated datasets and in silico mixes of real datasets containing different numbers of strains, as well as on in vitro mixed real datasets. Existing de novo haplotype reconstruction methods present their reconstruction as strain-specific sets of SNPs. We demonstrate that HaploDetox assigns strain-specific SNPs with a higher recall and similar precision than existing methods, by aligning the unitigs from strain-specific graphs to a reference genome. ConclusionsWe achieve improved strain-specific SNP phasing accuracy as compared to existing methods for de novo bacterial haplotype reconstruction. Additionally, HaploDetox is not limited to the determination of strain-specific SNPs, and other types of variant calls can be obtained through reference alignment. Finally, strain-specific de Bruijn graphs are an important first step towards full genome-length bacterial haplotype-aware assembly.

6
geneML: Gene annotation across diverse fungal species using deep learning

Vader, L.; Harvey, C. J.; Weber, T.; Hon, L. S.

2026-05-21 bioinformatics 10.64898/2026.05.18.725946 medRxiv
Top 0.2%
19.3%
Show abstract

Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML. Key Points- geneML improves gene prediction accuracy over both classical and recent deep learning-based methods, while substantially improving recall. - geneML predicts alternative transcripts with higher precision and recall than AUGUSTUS, expanding functional annotation. - Runtime was 32-fold decreased over BRAKER3, enabling efficient high-throughput genome annotation. - geneML identifies novel genes and recovers missing annotations, especially in under-annotated non-Ascomycete genomes.

7
BAT: an integrated pipeline for gene tree construction, annotation, and functional inference

Sheppard, B. D.; Behnken, B.; Steinbrenner, A.

2026-05-12 bioinformatics 10.64898/2026.05.07.721474 medRxiv
Top 0.2%
19.0%
Show abstract

Gene family functional exploration often requires analyzing motifs, domains, and associated datasets (e.g. gene expression) in the phylogenetic context of a gene tree. As genomic resources become more abundant, local pipelines are needed to analyze gene families of interest with project-specific resources. Here we present BLAST-Align-Tree (BAT), a bioinformatic pipeline for automated gene family phylogeny construction and annotation to enable gene tree exploration. BAT combines a BLAST search of local genome databases with a robust and flexible gene tree construction pipeline that enables multiple modes of annotation. Output visualizations display experimental datasets, custom regex specified amino acid motifs, and protein HMM domain annotations. For flexibility, BAT runs locally and is independent of pre-existing databases, allowing the easy incorporation of custom genomes and datasets. Three primary case studies described here demonstrate the utility of BAT for inferring the function of homologs and orthologs within characterized gene families. BAT is suitable for fine scale phylogenomic analysis of gene families across the tree of life, and default genomes available on installation span model eukaryotes.

8
On the benchmarking of clustering algorithms and hyperparameter influence for cell type detection in single-cell RNA sequencing data.

Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.

2026-05-17 bioinformatics 10.1101/2025.08.20.671270 medRxiv
Top 0.3%
18.7%
Show abstract

Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.

9
VX: an AI-enabled desktop genome viewer and transcriptome browser with a programmable analysis framework

Shirokikh, N. E.; Cleynen, A.

2026-05-20 bioinformatics 10.64898/2026.05.17.725790 medRxiv
Top 0.3%
18.6%
Show abstract

BackgsroundGenome and transcriptome browsers are central to the interpretation of high-throughput sequencing data, but todays tools assume a human operator at a graphical interface and offer only limited programmability. As large-language-model assistants become routine in bioinformatics [Anthropic, 2024], this creates a bottleneck: agents cannot observe the visual state of the browser or drive it through the same interface as the human user, and analyses remain fragmented across a separate ecosystem of external tools. Transcript-coordinate data, produced by ribosome profiling [Ingolia et al., 2012] and direct RNA sequencing [Garalde et al., 2018], is also awkwardly supported in chromosome-oriented viewers. ResultsWe present VX, a desktop genome and transcriptome viewer written in D, using GTK 3 and OpenGL, that handles genome-scale and transcriptome-scale data in a unified interface. VX exposes its full functionality through an embedded HTTP API on the loopback interface and a Model Context Protocol server of currently thirty-nine tools, so that scripts and LLM agents can load data, navigate, manage tracks, run analyses, and capture figures through the same contract used by the GUI. An integrated analysis framework provides more than fifty analyses and includes signal processing and peak calling, quantification, variant analysis, alignment statistics, interaction and cross-track comparisons, all with an explicit four-level scope hierarchy running from viewport to whole dataset; results are written to disk and, where appropriate, added as new tracks. Additional features include a magnifier popup for base-resolution inspection (Alt+hover), chromosome-alias resolution across UCSC, Ensembl, and NCBI conventions, viewport video recording via an ffmpeg pipe, and INI-based configuration. ConclusionsVX complements existing desktop and web browsers by providing a native agent-control layer, an integrated analysis framework, and first-class transcriptspace handling. The binary is freely available for non-commercial use; the HTTP API and MCP protocol are fully specified in this article, so third-party clients can be written independently of the core implementation.

10
StabCell: Stability selection for clustering and marker detection in single-cell RNA sequencing

Lück, N.; Rossi, A.; Staerk, C.

2026-05-12 bioinformatics 10.64898/2026.05.07.720061 medRxiv
Top 0.3%
18.6%
Show abstract

MotivationConventional pipelines for differential expression analysis in single-cell RNA sequencing (scRNA-seq) data first cluster individual cells and then test for differentially expressed genes between the resulting clusters. Using the same data for clustering and testing, however, poses a selective inference problem and can result in overconfidence in differences that may not reflect true biological variation. ResultsWe introduce StabCell, a stability selection framework which integrates clustering and detection of differentially expressed marker genes. By repeatedly performing clustering and differential expression analysis on complementary random subsamples, StabCell assesses clustering and marker stability, yielding a stable clustering with sets of stable marker genes. In simulations, we demonstrate that StabCell provides approximate empirical per-family error rate (PFER) control, selecting fewer false positive marker genes compared with conventional approaches, especially in cases with low signal-to-noise ratio and low sequencing depth. Applying the method to a cell differentiation dataset from induced pluripotent stem cells (IPSCs) to cardiomyocytes reveals that meaningful marker genes are consistently among the top-ranked genes. These results indicate that StabCell can improve the interpretability and robustness of scRNA-seq analyses. Availability and implementationAn implementation of StabCell in the statistical programming language R is available at https://github.com/LuckyLueck/StabCell. Code to reproduce the results is available at https://github.com/LuckyLueck/StabCell_paper.

11
RAPID: an interactive R/Shiny platform for end-to-end 16S rRNA and ITS amplicon sequence analysis using DADA2

Kapoor, B.; Cregger, M. A.; Ranjan, P.

2026-05-08 bioinformatics 10.64898/2026.05.05.723040 medRxiv
Top 0.3%
18.5%
Show abstract

MotivationAmplicon sequencing of 16S rRNA and internal transcribed spacer (ITS) gene regions is the most widely used approach for characterizing bacterial and fungal communities, respectively. The DADA2 pipeline has become a standard for inferring amplicon sequence variants (ASVs), offering single-nucleotide resolution over traditional OTU clustering. However, executing the full DADA2 workflow requires proficiency in R programming and manual coordination of multiple sequential steps, presenting a substantial barrier for researchers in clinical, environmental, and agricultural sciences who lack computational training. ResultsWe present RAPID (R-based Amplicon Pipeline for Interactive DADA2), a pair of R/Shiny applications providing complete graphical user interfaces for 16S rRNA and ITS amplicon sequence analysis. The 16S application implements a 10-step guided workflow from raw paired-end FASTQ files through quality filtering, error learning, dereplication, paired-read merging, chimera removal, taxonomy assignment (SILVA), phyloseq construction with data transformation (rarefaction, relative abundance, or CLR), interactive visualization (rarefaction curves, alpha diversity, NMDS, PCoA, taxonomic abundance), PERMANOVA, and ANCOM-BC2 differential abundance analysis. The ITS application extends this to an 11-step workflow, adding an automated primer removal step using cutadapt with support for multiple primers and length-variable amplicons, and uses the UNITE database for fungal taxonomy. Both applications feature asynchronous background processing, session persistence, real-time progress monitoring, publication-ready figure export, and comprehensive result downloads. AvailabilityRAPID is freely available at https://github.com/beantkapoor786/RAPID. Both applications can be installed locally on any system with R (version 4.0 or higher) and run as local web applications accessible through a standard browser.

12
PARiS: Probabilistic Assignment and Repartitioning of isomiR Sequences: A data-driven method for denoising isomiR read count data

Swan, H. K.; Baran, A. M.; Aparicio-Puerta, E.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.

2026-05-12 bioinformatics 10.64898/2026.05.09.723882 medRxiv
Top 0.3%
18.3%
Show abstract

MicroRNAs (miRNAs) are non-coding RNAs, approximately 18-24 nucleotides in length, with important gene regulatory functions. In small RNA sequencing (sRNA-seq), observed isoforms of miRNA, called isomiRs, arise from my biological and technical processes. Alterations in isomiR expression has been linked to a wide variety of human diseases, from cancers to neurological diseases. However, it is difficult to distinguish between technical and biological isomiRs. We present PARiS, an algorithm for the Probabilistic Assignment and Repartitioning of isomiR Sequences, that identifies technical error isomiRs in sRNA-seq data and reassigns them to their most likely biological source. We assess the ability of PARiS to identify and remove error isomiR sequences in a realistic simulation study. Additionally, we compare PARiS to alternative approaches, focusing on downstream miRNA-level differential expression analysis in a variety of settings, including a set of simulated datasets, an experimental benchmark dataset, and three colorectal adenocarcinoma cell lines.

13
On the state of protein function prediction: a report on the fourth CAFA challenge

Ramola, R.; De Paolis Klauza, M. C.; Piovesan, D.; Peng, Y.; Joshi, P.; Mehdiabadi, M.; Quaglia, F.; Pancsa, R.; Chemes, L. B.; Ahmadi, M.; Ahn, H.; Altenhoff, A. M.; Asgari, E.; Aspromonte, M. C.; Atalay, V.; Babbi, G.; Baldazzi, D.; Barot, M. M.; Ben-Hur, A.; Benso, A.; Berenberg, D.; Bjorne, J.; Boecker, F.; Boldi, P.; Bonello, J.; Bordin, N.; Borole, P.; Ebrahimpour Boroojeny, A.; Cao, R.; Di Carlo, S.; Casadio, R.; Casiraghi, E.; Chang, J.-M.; Chen, C.; Chen, T.-M.; Cheng, J.; Chiu, S.; Dalkiran, A.; Davidovic, R. S.; Dessimoz, C.; Diao, R.; Djeddi, W. E.; Dogan, T.; Flannery, S. T.; Font

2026-05-11 bioinformatics 10.64898/2026.05.06.722942 medRxiv
Top 0.4%
17.9%
Show abstract

BackgroundThe Critical Assessment of Functional Annotation (CAFA) is a community effort held to understand the field of computational protein function prediction. Every three years, since 2010, the organizers initiate an experiment to collect function predictions on a large set of proteins and then evaluate the performance of predicting methods on a subset of proteins that have accumulated experimental annotations between the submission deadline and the evaluation time. CAFA provides an independent and rigorous assessment of the current state of the art, thus leveling the playing field, highlighting successes, revealing bottlenecks, and offering a forum for the exchange of ideas in protein science. Here, we report the results of the fourth CAFA experiment (CAFA4). ResultsCAFA4 featured the participation of 148 methods from 70 research groups on a total of 46,205 unique proteins over a 5-year annotation accumulation phase, the longest in any CAFA. In a comparison across CAFA2-CAFA4 methods, the prediction of Gene Ontology (GO) terms has clearly improved across all three GO aspects and traditional evaluation settings. While not achieving the first rank, several CAFA2 and CAFA3 methods featured in the top ten methods in many evaluations, suggesting that earlier methods still hold relevance. The performance is weaker in the newly introduced "partial knowledge" evaluation category (proteins with experimental annotations before submission deadline that gained additional annotations in the same GO aspect during the annotation accumulation phase), highlighting the need for a new class of methods. The rankings of the methods were stable over the years in traditional evaluation settings, but less so in the new partial knowledge evaluation. Overall, the field continues to progress with some influx of new participants. Sustained efforts will be necessary to substantially advance it.

14
MIMOSA: A model-independent framework for transcription factor binding site motif similarity assessment

Tsukanov, A. V.; Levitsky, V. G.

2026-05-17 bioinformatics 10.64898/2026.05.13.725009 medRxiv
Top 0.4%
15.1%
Show abstract

MotivationTranscription factors (TFs) regulate gene expression by binding specific DNA sequences, which are commonly represented by motif models. Although position weight matrices (PWMs) remain the dominant motif representation, alternative models, such as Markov models, can capture interpositional dependencies and may provide higher predictive performance. However, existing motif comparison tools are designed mainly for PWMs or require motifs to be reduced to PWM/PPM representations. This creates a major bottleneck for comparing motifs represented by different model architectures. This limitation complicates the interpretation of de novo motif discovery results and hinders the systematic integration of diverse motif models into genomic analyses. ResultsWe present MIMOSA (Model-Independent Motif Similarity Assessment), a model-independent framework for direct comparison of TF binding site (TFBS) motifs regardless of their mathematical representation. MIMOSA assesses motif similarity by comparing calibrated recognition profiles produced by motifs of different models on the same DNA sequence set, rather than by comparing the motifs themselves. In a cross-database benchmark on HOCOMOCO motifs, MIMOSA achieved retrieval performance comparable to established PWM-oriented tools, including Tomtom and MACRO-APE, with MRR and Recall@k close to the best-performing methods. Pairwise ranking comparisons showed that MIMOSA captures a similarity signal consistent with existing approaches while providing a representation-independent comparison strategy. Application to de novo motifs derived from ChIP-seq data for the ATF3 TF demonstrated that recognition-profile comparison distinguished alternative spacer variants represented as separate PWMs from their integration within more flexible models such as BaMM and Slim. Thus, MIMOSA enables quantitative cross-model motif comparison and supports interpretation of motif heterogeneity in TFBS analyses. Availability and implementationMIMOSA is implemented in Python and is freely available at https://github.com/ubercomrade/mimosa.

15
MOSAIC: Model-based, Subgroup-Aware Identification of Driver Mutations in Cancer

Campbell, K.; Reyna, M. A.

2026-05-03 bioinformatics 10.64898/2026.04.29.721672 medRxiv
Top 0.5%
14.6%
Show abstract

In cancer genomics, recurrent patterns of mutual exclusivity within a gene set can indicate shared biological context and involvement in tumorigenesis. However, existing methods are not designed to distinguish between mutual exclusivity arising from meaningful biological interactions from those influenced by heterogeneity between underlying patient subpopulations. In this work, we introduce MOSAIC, a novel statistical framework that models patient subgroup heterogeneity in mutual exclusivity analyses. In experiments with simulated data and real data from The Cancer Genome Atlas, we show that MOSAIC amplifies subgroup-specific mutual exclusivity signals, including between IDH1 and IDH2 in young low grade glioma patients, while reducing the effect of signals produced by underlying subgroup structures, such as distinct genomic lineages associated with histological subtypes of endometrial cancer. Finally, we demonstrate that MOSAIC is more powerful than existing p-value combination methods for patient subgroup stratification. MOSAIC is available as an open-source tool at https://github.com/reynalab/mosaic.

16
De novo protein discovery in non-model organisms

Ali, A.

2026-05-13 bioinformatics 10.64898/2026.05.08.723910 medRxiv
Top 0.5%
14.5%
Show abstract

We developed plant (Parallel Annotation of Transcriptomes), a de novo method that can potentially compare RNA-seq data of any two species without a reference genome. plant is conceptually similar to chromatography. In the same way a complex mixture is filtered to isolate its individual components, we applied a computational method to identify, annotate, and quantify components across transcriptomes. The comparison points are universal protein domain annotations rather than species-specific genes, as would be the case for a differential gene expression analysis. We looked at several Selaginella species via the 1000 Plant transcriptomes initiative (1KP) where RNA-seq data for various plant species have been made publicly available. The raw reads were assembled via Trinity. The assembled transcripts were then searched against the Pfam protein domain database via InterProScan. The assembled transcripts were also quantified via kallisto. By merging these two aspects, we were able to see how often a particular protein domain - a predicted protein structure - is expressed. These quantified annotations of protein domains are comparable across species, assuming a relatively short evolutionary distance. We were also able to identify the presence of species-specific protein domains and trace each annotation back to the gene. A bubble plot was created to visualize the distributions of Pfam annotations across species as well as GO terms.

17
Efficient Stochastic Trace Generation for Transcription

Ferdowsi, A.; Fuegger, M.; Nowak, T.

2026-05-08 bioinformatics 10.64898/2026.05.05.722871 medRxiv
Top 0.5%
14.5%
Show abstract

Bursty transcription in single cells typically produces over-dispersed, skewed, and sometimes heavy-tailed expression distributions that are explained by two-state Markov models of the promoters. While the gold standard for simulation is exact stochastic sampling with Gillespies algorithm, obtaining thousands of timed traces is computationally costly. Surrogate models based on stochastic differential equations (SDEs) are widely used to speed up this simulation process. An example is the Chemical Langevin Equation based on Gaussian noise, which, however, does not capture heavy-tailed noise. In this work, we present a unified SDE framework that combines deterministic drift, Gaussian fluctuations, and additive sporadic jumps of arbitrary distributions, and provide an open-source Python implementation, bcrnnoise. The framework subsumes standard surrogate models and allows for vectorized generation of batches of transcription traces. We assess computational speed and accuracy of common surrogate models along with new models, showing that high accuracy can be obtained while reducing computational cost up to two orders of magnitude.

18
Discriminative learning of substitution matrices and gap penalties for pairwise alignment of biological sequences

Ciach, M. A.; Zacharopoulou, E.; Startek, M. P.; Miasojedow, B.; Alexiou, P.

2026-05-18 bioinformatics 10.64898/2026.05.14.725168 medRxiv
Top 0.5%
14.4%
Show abstract

Pairwise alignment scores are used to classify pairs of sequences in many areas of bioinformatics, including homology search, predicting interactions, or read mapping. The relative scores of different pairs strongly depend on the choice of a substitution matrix and gap penalties, but the existing approaches for the estimation of these parameters do not directly optimize them for the task of classification. In this work, we present DiscrimAlign, a statistical model for discriminative learning of substitution matrices and gap penalties from a dataset of positive and negative pairs of unaligned biological sequences. The model links the alignment score of a sequence pair with the associated binary label through a logistic function and learns the parameters by likelihood maximization. We analyze theoretical properties of the model, derive and implement a learning procedure, study its performance in simulated experiments, and apply it to predict microRNA-target interactions. We show that sequence alignment with discriminative substitution matrices and gap penalties predicts the interactions comparably to state-of-the-art neural network classifiers while being more interpretable. An implementation of the model and reproducibility workflows are available at https://github.com/BioGeMT/DiscrimAlign.

19
Benchmarking long-read simulators against Oxford Nanopore whole-genome sequencing data

Taouk, M. L.; Ingle, D. J.; Wick, R. R.

2026-05-11 bioinformatics 10.64898/2026.05.06.723380 medRxiv
Top 0.5%
14.1%
Show abstract

BackgroundOxford Nanopore Technologies (ONT) sequencing is increasingly used for whole-genome sequencing (WGS) across a wide range of applications. However, the platform has evolved rapidly through updates to flow cell chemistry and basecalling algorithms, altering the characteristics of the resulting sequencing data. Read simulators provide synthetic datasets with known ground truth, enabling controlled development and evaluation of methods. However, many existing simulators were developed for earlier versions of ONT sequencing or use generic long-read assumptions, and their realism for contemporary ONT data is unclear. ResultsWe benchmarked six ONT-compatible read simulators (Badread, LongISLND, lrsim, NanoSim, PBSIM3 and SimLoRD) using a microbial genome reference and ONT R10.4.1 reads as the empirical standard. Each tool was configured to maximise realism, including training on empirical reads when supported. We compared simulated and real datasets with respect to read length, read accuracy, FASTQ quality scores and sequence error profiles. No simulator reproduced all metrics of the real data well. PBSIM3 most closely reproduced read length, read accuracy and FASTQ quality scores, making it a strong simulator for broad read-level realism. However, it did not capture important features of the real error profile, including context-dependent substitution rates and homopolymer-length errors. Badread and LongISLND better reproduced some aspects of the error profile, but showed other departures from the real data. ConclusionPBSIM3 is a good general-purpose choice for many ONT WGS simulation tasks because it reproduced several key read-level properties well. However, Badread or LongISLND may be preferable for applications where error structure is more important. No evaluated tool was realistic across all tested metrics, highlighting a gap for improved long-read simulators.

20
Heterogeneity-driven adaptive scale graph learning for subcellular spatial transcriptomics

Shi, W.; Shen, C.; Liu, Y.; Xiao, Q.; Luo, J.

2026-05-21 bioinformatics 10.64898/2026.05.19.726162 medRxiv
Top 0.6%
13.2%
Show abstract

MotivationSpatial transcriptomics enables gene expression profiling within intact tissue sections, providing an important basis for analyzing tissue organization, cellular heterogeneity, and microenvironmental interactions. However, existing spatial structure identification methods often integrate spatial information using fixed neighborhoods or predefined smoothing scales, which limits their ability to adapt to region-specific structural heterogeneity. In homogeneous regions, broader spatial smoothing can help preserve continuous tissue structures, whereas in regions with complex boundaries or mixed cell populations, excessive smoothing may obscure local expression differences and fine-scale structural changes. Therefore, it is necessary to develop an adaptive graph learning framework that can adjust the range of spatial information integration according to tissue structural heterogeneity. ResultsIn this study, we propose HAST, a heterogeneity-driven adaptive-scale graph learning framework for spatial transcriptomics. HAST adaptively determines graph filtering scales according to spatial structural heterogeneity, enabling flexible information aggregation across different tissue regions. It further decomposes gene expression signals into low-frequency structural components and high-frequency residual components, thereby jointly modeling global spatial continuity and local expression variations. Experiments on high-resolution spatial transcriptomics datasets show that HAST improves spatial structure identification and cross-section generalization. Tumor-enriched cluster identification and neighborhood enrichment analysis further demonstrate its ability to characterize tumor-associated spatial regions and microenvironmental organization.